Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Sparse int64 and bool dtype support enhancement #13849

Merged
merged 1 commit into from Aug 31, 2016

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jul 30, 2016

Currently, sparse doesn't support int64 and bool dtypes actually. When int or bool values are passed, it is coerced to float64 if dtypekw is not explicitly specified.

on current master

pd.SparseArray([1, 2, 0, 0 ])
# [1.0, 2.0, 0.0, 0.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2, 3], dtype=int32)

pd.SparseArray([True, False, True])
# [1.0, 0.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2], dtype=int32)

after this PR

The created data should have the dtype of passed values (as the same as normal Series).

pd.SparseArray([1, 2, 0, 0 ])
# [1, 2, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([0, 1], dtype=int32)

pd.SparseArray([True, False, True])
# [True, False, True]
# Fill: False
# IntIndex
# Indices: array([0, 2], dtype=int32)

Also, fill_value is automatically specified according to the following rules (because np.nan cannot appear in int or bool dtype):

Basic rule: sparse dtype must not be changed when it is converted to dense.

  • If sparse_index is specified and data has a hole (missing values):
    • fill_value is np.nan
    • dtype is float64 or object (which can store both data and fill_value)
  • If sparse_index is None (all values are provided via data, no missing values)
    • if fill_value is not explicitly passed, following default will be used depending on its dtype.
      • float: np.nan
      • int: 0
      • bool: False

@sinhrks sinhrks added Enhancement Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Jul 30, 2016
@sinhrks sinhrks added this to the 0.19.0 milestone Jul 30, 2016
@codecov-io
Copy link

codecov-io commented Jul 30, 2016

Current coverage is 85.27% (diff: 98.63%)

Merging #13849 into master will increase coverage by <.01%

@@             master     #13849   diff @@
==========================================
  Files           139        139          
  Lines         50511      50523    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43071      43083    +12   
  Misses         7440       7440          
  Partials          0          0          

Powered by Codecov. Last update 10bf721...341585a

@sinhrks sinhrks force-pushed the sparse_dtype3 branch 2 times, most recently from 3ab5f3e to d500761 Compare August 4, 2016 11:48
jreback pushed a commit that referenced this pull request Aug 4, 2016
split from #13849

Author: sinhrks <sinhrks@gmail.com>

Closes #13900 from sinhrks/sparse_astype and squashes the following commits:

1c669ad [sinhrks] ENH: sparse astype now supports int64 and bool
@sinhrks sinhrks force-pushed the sparse_dtype3 branch 7 times, most recently from b49c1c8 to 4a7c84b Compare August 6, 2016 19:29
@jreback
Copy link
Contributor

jreback commented Aug 7, 2016

@sinhrks getting tons of warnings compiling on windows....all the same

pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data

@sinhrks sinhrks force-pushed the sparse_dtype3 branch 2 times, most recently from 6c4e0ee to 4eacbec Compare August 8, 2016 11:21
jreback pushed a commit that referenced this pull request Aug 9, 2016
@sinhrks sinhrks force-pushed the sparse_dtype3 branch 6 times, most recently from c334402 to 21861f0 Compare August 16, 2016 22:37
@jorisvandenbossche
Copy link
Member

Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.

@jorisvandenbossche
Copy link
Member

@sinhrks Does this also close #13110?

@sinhrks
Copy link
Member Author

sinhrks commented Aug 21, 2016

I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).

#13110 should be closed. Added whatsnew.

@@ -17,6 +17,7 @@ Highlights include:
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
- ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>`
- Sparse now supports other ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would leave out other

@jorisvandenbossche
Copy link
Member

Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.

Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True])?
Of course this is currently because boolean serieses cannot have anything else as True or False.

@jorisvandenbossche
Copy link
Member

OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse does, so seems like a sensible default then.


Sparse data should have the same dtype as its dense representation. Currently,
``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original
dtype, ``fill_value`` default changes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note here somewhere that for int and bool this was only added from 0.19 ?

@jreback
Copy link
Contributor

jreback commented Aug 27, 2016

joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)

so this is not a conceptual change at all just a change to keep dtype consistency

@jorisvandenbossche
Copy link
Member

@jreback I was looking at the to_sparse examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.

@jorisvandenbossche
Copy link
Member

@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))

@sinhrks sinhrks force-pushed the sparse_dtype3 branch 2 times, most recently from c040583 to 38c6661 Compare August 29, 2016 06:56
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 29, 2016

@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.

@jorisvandenbossche jorisvandenbossche merged commit b6d3a81 into pandas-dev:master Aug 31, 2016
@jorisvandenbossche
Copy link
Member

@sinhrks Thanks a lot!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 31, 2016

@sinhrks appveyor started failing (some int dtype issues):

======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
    tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
    assert_numpy_array_equal(left.sp_values, right.sp_values)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
    assert_attr_equal('dtype', left, right, obj=obj)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]:  int64
[right]: int32

@sinhrks
Copy link
Member Author

sinhrks commented Sep 1, 2016

@jorisvandenbossche thx for pointing out, will fix.

@sinhrks sinhrks deleted the sparse_dtype3 branch September 1, 2016 01:33
@kawochen kawochen mentioned this pull request Dec 5, 2016
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment